Natural language processing systems for extracting information from electronic health records about activities of daily living. A systematic review

Abstract Objective Natural language processing (NLP) can enhance research on activities of daily living (ADL) by extracting structured information from unstructured electronic health records (EHRs) notes. This review aims to give insight into the state-of-the-art, usability, and performance of NLP systems to extract information on ADL from EHRs. Materials and Methods A systematic review was conducted based on searches in Pubmed, Embase, Cinahl, Web of Science, and Scopus. Studies published between 2017 and 2022 were selected based on predefined eligibility criteria. Results The review identified 22 studies. Most studies (65%) used NLP for classifying unstructured EHR data on 1 or 2 ADL. Deep learning, combined with a ruled-based method or machine learning, was the approach most commonly used. NLP systems varied widely in terms of the pre-processing and algorithms. Common performance evaluation methods were cross-validation and train/test datasets, with F1, precision, and sensitivity as the most frequently reported evaluation metrics. Most studies reported relativity high overall scores on the evaluation metrics. Discussion NLP systems are valuable for the extraction of unstructured EHR data on ADL. However, comparing the performance of NLP systems is difficult due to the diversity of the studies and challenges related to the dataset, including restricted access to EHR data, inadequate documentation, lack of granularity, and small datasets. Conclusion This systematic review indicates that NLP is promising for deriving information on ADL from unstructured EHR notes. However, what the best-performing NLP system is, depends on characteristics of the dataset, research question, and type of ADL.


Introduction
The ever-increasing amount of data recorded by physicians or nursing staff in patients' electronic health records (EHRs) offers opportunities for clinical practice and research.Although EHR systems are primarily designed for documentation about individual patient care, EHR data are increasingly used for scientific research.The data used for this purpose are predominantly structured data, which are recordings following a fixed format or category.However, solely using structured EHR data for research could lead to biased results, for example, because this may lead to an underestimation of the incidence and prevalence, 1,2 and low performance of prediction models 3,4 of health problems.
Using unstructured health data, such as clinical notes and discharge letters, can enhance the quality of research results by capturing valuable information not found in structured data.It is estimated that more than half of all health records in the EHR systems are unstructured data. 5Even if health information could be recorded as structured data, healthcare professionals sometimes prefer to use unstructured free-text notes, for example, because they think it allows a more accurate representation of the patient's situation. 6,7][10] The activities of ambulating, feeding, dressing, personal hygiene, continence, and toileting are referred to as activities of daily living (ADL). 11For care provision, adequate information on ADL is important for ensuring individuals receive the necessary daily support.Also, the research using health data on ADL requires adequate information to provide insight into the need for support with ADL and, for instance, the effect of a treatment on the ability to perform ADL.
ADL could be recorded in a structured way, using assessment tools such as the Barthel Index and Katz Activities of Daily Living Index. 12The International Classification of Functioning, Disability and Health (ICF) also categorizes different daily activities as part of a larger framework on functional status. 13Furthermore, there are ADL measures developed for a specific target population, for example, the Expanded Disability Status Scale (EDSS) (for Multiple Sclerosis) 14 and the Karnofsky Performance Status Scale (for cancer). 15[10] To extract information from unstructured EHR data, Natural Language Processing (NLP) is currently the most widely used "big data" analytical technique. 16NLP, a subfield of artificial intelligence, focuses on computers and humanlanguage interaction.NLP can be used for various applications, such as information retrieval, text classification, topic identification, word frequency calculation, and sentiment analysis. 17dvancements in computing power, greater availability of large datasets, and recent breakthroughs in the field of NLP have increased the potential for generating valuable insights from unstructured EHR data. 18While the oldest NLP approach, the rule-based approach, relies on manual rule construction by experts, machine-learning approaches, including Support Vector Regression and Conditional Random Fields, are able to train algorithms with less manual coding. 19Rule-based and machine-learning models generally involve a pre-processing phase to standardize text by cleaning and preparing textual data as tokens, and a modeling phase in which unstructured textual data is fed into a model.1][22] The latest breakthrough in machine learning is the deep-learning approach.Examples of deep-learning models are Word2vec 23 and transformers, such as Bidirectional Encoder Representations from Transformers (BERT). 24In NLP, deep-learning models take a holistic approach considering the entire context and relationships within the sentence rather than individual tokens.This enables deep-learning models to analyze complex patterns in texts. 25,26In addition, the holistic approach avoids extensive pre-processing of texts. 27lthough the opportunities for NLP in the healthcare sector are recognized, the usability for clinical practice and research depends on how well a NLP system performs. 28For example, overfitting is a common concern with machinelearning models.Overfitting means that an algorithm aligns too closely with a specific dataset, limiting its application to future data.][30] While previous systematic reviews have explored the processing of unstructured clinical notes (eg, 19,[31][32][33][34][35] ) none have specifically focused on ADL.This gap makes it hard to draw conclusions and recommendations for using NLP to derive information on ADL from unstructured EHR notes.It is of significance to understand recent developments specific to NLP in the ADL research field as this will help researchers to gain a broader understanding, and provide insight into methods and techniques supporting and promoting new developments in the field of ADL research.

Objective and research questions
This systematic review aims to give insight into the state of the art and usability of NLP systems to extract information on ADL from EHRs.The specific review questions addressed are as follows: 1) Which NLP systems are used to extract information on ADL from routinely recorded unstructured free-text data in EHRs? 2) Which methods are used to evaluate the performance of these NLP systems in research?3) How do the NLP systems perform with regard to extracting information on ADL from EHRs?

Design
The reporting of this systematic review was guided by the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement. 36

Search strategy and information sources
In November 2022, a librarian, in consultation with the authors of this paper, conducted searches in the Pubmed, Embase, Cinahl, Web of Science, and Scopus databases, using predetermined search strategies (Supplementary Appendix 1).

Eligibility criteria
The following criteria were used by the reviewers during the selection of relevant studies.
1) The study is an empirical study All types of empirical studies were eligible for inclusion, including gray literature.Editorials, essays, literature reviews, or other non-empirical studies were excluded.

2) The full text of the study had to be available 3) The study concerns at least 1 activity of daily living
Studies had to use information on at least 1 activity (ambulating, feeding, dressing, personal hygiene, continence, or toileting). 11There were no restrictions regarding care setting and age, health status, and type of disease of the study population.

4) Study uses information on ADL that is routinely recorded as unstructured free text in an EHR system
To be included, studies had to use information on ADL that was recorded as unstructured free text in an EHR system by a healthcare professional.Studies were excluded if information on ADL was not routinely recorded, for example, if patients recorded them in a one-time questionnaire as part of scientific research.5) NLP system is used in the study to extract information on ADL Studies were excluded if they only used manual processing of the unstructured free texts.

Data extraction and synthesis
For each of the 22 included studies, 1 author (D.v.K. or Y.W. J.) manually extracted information, which was then verified by another author (Y.W.J., D.v.K., R.V., A.F., or M.O.V.).The extracted data were inserted in a prestructured format, developed in consultation with all the authors.The extracted information concerned background information on the study, including the aim of the study, country, whether the unstructured records were retrieved directly from an EHR system or other database, the study population, and type of ADL.Moreover, data were extracted on the NLP system, including the type (rule-based, machine learning, or deep learning), aim (eg, classification or data extraction), pre-processing steps, and tools.Lastly, data were extracted on the evaluation of the NLP system, including the metrics used to evaluate the NLP system, NLP system's performance, and limitations of the method according to the authors of the specific study.
To address the research question regarding which NLP systems were used, we identified the most common aim (ie, data classification or extraction) and type of NLP system (rulebased, machine learning, deep learning, or a combination of these).To analyze trends in the NLP system used, we looked at whether the type of NLP system varied over the years.Furthermore, we determined frequently used pre-processing techniques and identified studies with no or few pre-processing steps.Lastly, we described the software used for the NLP.To answer the research question about which evaluation methods were used, we looked at the most commonly used methods and compared their prevalence across different types of NLP systems.To address the research question regarding the performance, we outlined the primary performance metrics used to evaluate the NLP system.

Study characteristics
Twenty-two studies were included in the review (Table 1).The aim of the studies included is described in Table S1.The year with the most publications was 2019 (n ¼ 6).Eleven studies were published in 2020 or later.
As 2 of them 2,38 used the NLP system developed in the study by Anzaldi et al, 37 we only refer to the study of Anzaldi et al 37 in the remainder of this review.Thus, the total number of NLP systems we report on is 20.

Pre-processing
49,[52][53][54][55] In general, deep learning requires less pre-processing compared to rule-based and machine-learning models.In the 3 studies that employed deep learning only, little to no pre-processing was performed.One of the 3 studies 46 only used sentence segmentation, the second study reported no pre-processing steps, 48 and the third study used a standard tool for text-cleaning methodologies. 47However, as the precise methodologies applied to the dataset were not specified in this last study, the extent of data pre-processing remains unclear.
Pre-processing details were not reported in 3 other studies.Anzaldi et al used a rule-based approach for identifying geriatrics syndromes in EHR free-text notes as well as the explicit mention of "frailty" in the notes. 37In such an approach, preprocessing is not always a necessity.In the study of Rivera et al, although pre-processing was not reported, we cannot conclude that data pre-processing was therefore not applied as a machine learning usually involves pre-processing. 45Lastly, Meskers et al used BERTje to create vectors instead of preprocessing methods. 44

Software
The studies used different software (Table S2).4][55][56][57] For 4 studies it is unclear which software was used. 37,47,48,52In addition, Javascript or a tool developed for NLP applications, including cTakes, GATE, MedTagger, and CRFSuite were used. 39,42,45,46,51thods used to evaluate NLP system performance Almost all studies used cross-validation or train-/test datasets to evaluate the NLP system's performance (Table S2).Six studies evaluated their NLP system using both cross-validation and train-/test datasets. 43,44,49,52,55,57In 4 studies, an expert manually evaluated the performance of the NLP system.Five other studies used solely train-/test datasets (n ¼ 5) and 4 other studies only cross-validation (n ¼ 4). 50,53,54,56he study by Goudarzvand et al used recent publications to evaluate their NLP system, 51 which was used for topic modeling-the only study in our review with this purpose.As topic modeling lacks a gold standard for comparing the outcome of the model with, they validated the results against recent publications to verify whether meaningful outcomes were generated.

Outcomes of the performance evaluation
More than half of the studies reported relatively high scores for the evaluation metrics (n ¼ 12), 37,39,41,44,[47][48][49][50][52][53][54]56 indicating good performance by the NLP systems for that dataset and the purpose of the study (Table S2). This was rticularly the case for systems extracting information on ambulating.
Other studies reported mixed performance outcomes (n ¼ 7). 40,42,43,45,46,55,57Some of the studies showed different outcomes when the results were stratified based on the type of ADL.For instance, Chen et al used a machine-learning model for classifying geriatric syndrome constructs.High scores were obtained for fecal control (F1 ¼ 0.857) and walking difficulty (F1 ¼ 0.758), but for severe urinary control issues low scores were obtained (F1 ¼ 0.532). 42Another study by Chen et al applied a deep-learning model for the same constructs and obtained comparably mixed scores. 46umbert-Droz et al found that scores varied depending on the method used to evaluate the NLP system.They evaluated NLP performance by comparing the outcome of the NLP tool with (1) a manual review, (2) structured EHR data, and (3) an external database.The highest scores for sensitivity, positive predictive value, and F1 scores were observed for the manual review, while the lowest scores were found in the comparison with structured EHR data.Humbert-Droz et al pointed out that this does not necessarily reflect the NLP system's performance.They encountered several issues with structured EHR data limiting their use as the gold standard in an evaluation. 40Furthermore, mixed results were found when different approaches were compared.For example, Yang et al showed higher scores for a combined ruled-based and deep-learning model, compared to the scores for each approach individually.They noted that this hybrid approach was better at leveraging the strengths of each approach and tackling challenges with regard to the dataset, including imbalanced data. 57uthors of the studies in this systematic review identified several limitations concerning the NLP systems they developed.Generalizability to other healthcare sectors, practices, languages, patient groups, or data sources emerged as a significant challenge, 13,37,[39][40][41]43,47,49,[52][53][54][55][56][57] as the NLP systems were trained on datasets with specific characteristics. Another major hallenge relates to the dataset on which the NLP system is trained and tested.Authors reported issues with small datasets due to factors such as restricted access to relevant EHR data, few amount of notes per patient due to a short hospital stay, or few patients in the study sample.37,39,49,54 In addition, inadequate documentation and lack of granularity were mentioned.41,[43][44][45]49,51,52,54,57

Principal findings
This systematic review provides a comprehensive overview of current research employing NLP to extract information on ADL from unstructured free-text notes in EHRs.Adequate information on ADL is important for care provision and research to ensure that individuals receive the necessary daily support.As information on ADL is often recorded in unstructured free-text EHR notes, NLP could be valuable for deriving this information.We explored 20 NLP systems described in 22 studies.Most studies (65%) utilized NLP for classifying unstructured EHR data on 1 or 2 ADL.Our findings show that a variety of NLP methods, algorithms, and preprocessing steps were used.There was a notable prevalence of deep-learning approaches.The majority of studies using deep learning also applied ruled-based methods or machine learning.Evaluation of the NLP system's performance predominantly involved train-/test datasets and cross-validation.
The studies included in this review used a wide range of evaluation metrics, including F1, precision, and sensitivity.Despite the variety of NLP approaches and evaluation metrics, most studies reported relativity high overall scores on the evaluation metrics, indicating that the characteristics of the best-performing NLP system depend on study-specific factors.
The variability in models, approaches, and reporting complicates the direct comparison between the NLP systems and the quest for the best possible method.However, overall, the results of this review indicate that NLP systems are promising for research using unstructured EHR data on ADL for the following reasons.
First, the field of NLP is developing rapidly.It has evolved from ruled-based methods to machine learning and deep learning.Compared to previous systematic reviews on the use of NLP for unstructured EHR notes, we included relatively more deep-learning approaches. 32,34,35This shows that relatively new deep-learning algorithms, including transformers such as BERT, are being studied for NLP systems to extract information from unstructured clinical notes on ADL.
To improve the performance of the NLP system, often multiple approaches are compared or combined.Most studies adopted a hybrid approach by combining deep learning with ruled-based or machine-learning algorithms in their final model.The possible benefits of hybrid approaches are also recognized by systematic reviews that focused on the application of NLP in other healthcare domains, including radiology, 61,62 clinical information in general, 31,34 and chronic diseases. 35Hybrid approaches may be better able to address challenges related to the dataset, such as small or imbalanced datasets.Some of the studies included in this review encountered challenges with the datasets arising from how the information was recorded during healthcare provision, such as inadequate recordings or a low level of granularity, or because they did not have access to all relevant EHR data.These challenges are not unique to unstructured data but are also mentioned in the broader literature discussing data quality challenges in the use of EHR data for research (eg, 59,63 ) Second, the characteristics of the best-performing NLP system depend on the context in which the dataset is generated, such as different EHR systems and different healthcare organizations.The studies included in this review that retrieved the data directly from an EHR system, rather than from a research database or registry, had access to data from a single organization or from organizations belonging to one medical group.It is expected that the NLP system will perform differently on datasets with other characteristics.NLP systems trained on datasets from multiple sources with different characteristics will have a higher external validity.
Third, a variety of metrics were used to evaluate the performance of the NLP systems.However, most studies evaluated the performance with train-/test datasets and crossvalidation and reported F1 scores.Although the most appropriate evaluation metrics depend on the research aim, F1 scores are commonly valuable in many cases, especially for classification purposes, which was the most prevalent purpose of the NLP systems in this review.Almost all F1 scores exceeded 0.7.This indicates that the methodologies used in developing the NLP systems, considering the characteristics of the specific dataset and research question of the study, are promising for generating information on ADL from unstructured EHR data.

Strengths, limitations, and recommendations for further research
To the best of our knowledge, this is the first systematic review exploring NLP systems for extracting information on ADL from unstructured EHR data.A strength is that we used a broad search strategy in 5 different literature databases.However, the following limitations should be kept in mind.First, while ambulating and continence were covered by most studies, some ADL were only included in a few NLP systems.More research on NLP systems covering all 6 ADL is recommended.Second, some studies provided limited information on the algorithms, for example with few details on the preprocessing.5][66] Third, the field of NLP is developing rapidly.To keep up with the developments, it is recommended to conduct the search again in the near future.

Conclusion
The results of this systematic review indicate that NLP is a promising method for deriving information on ADL from unstructured EHR notes.Various NLP systems are already used in research and show overall good evaluation outcomes.Choosing which NLP system will perform best, depends on the characteristics of the dataset, research question, and type of ADL studied.Since there is no one-size-fits-all method, our findings suggest that research on ADL could benefit from an iterative process in which different NLP approaches are compared or combined based on the performance evaluation outcomes.Future developments in NLP for ADL extraction should focus on addressing generalizability issues and refining evaluation methodologies.

Figure 1 .
Figure 1.PRISMA (preferred reporting items for systematic reviews and meta-analyses) flow diagram.

Table 1 .
Characteristics of the study and EHR data.

Table 2 .
Pre-processing steps used in the included studies.